Text semantic de-duplication algorithm based on keyword graph representation#br#

	#br#

doi:10.11772/j.issn.1001-9081. 2022101495

Abstract

Abstract: There are a large number of redundant texts with the same or similar semantics in the network. Text de-duplication can solve the problem that redundant texts wasted storage space and reduce unnecessary consumption for information extraction tasks. Traditional text de-duplication methods mostly rely on literal information, and could not capture the interaction information between sentences that are far away in long text, so the de-duplication effect is not ideal. To solve the problem of text semantic de-duplication, a semantic de-duplication algorithm based on keyword graph representation was proposed. First, the text pair was represented as a graph with the keyword phrase as the vertex by extracting the semantic keyword phrase from the text pair. Secondly, the nodes were encoded in various ways, and then the Graph Attention Network (GAT) was used to learn the relationship between nodes to obtain the vector representation of text pairs, and to judge whether the text pairs were semantically similar. Finally, the similar text was de-duplicated according to the text pair’s semantical similarity. Compared with the traditional methods, it can effectively use the semantic information of the text, and through the graph structure, it can connect the distant sentences in the long text through the co-occurrence relationship of keyword phrases to increase the semantic interaction between different sentences. Experiments show that the proposed algorithm performs better than the traditional algorithms, such as Simhash, BERT (Bidirectional Encoder Representation from Transformers) fine-tuning and CIG (Concept Interaction Graph), on both CNSE (Chinese News Same Event) and CNSS (Chinese News Same Story) datasets. The F1 score in the CNSE dataset reaches 84.65%, and the F1 score in the CNSS dataset reaches 90.76%, indicating that the proposed algorithm can effectively improve the effect of text de-duplication tasks.

Key words: text semantic de-duplication, keyword extraction, text matching, graph representation, Graph Attention Network (GAT)

摘要： 网络中存在大量语义相同或者相似的冗余文本，文本去重能够解决冗余文本浪费存储空间等问题,并能为信息抽取任务减少不必要的消耗。传统的文本去重方法依赖于文字重合度信息，没有很好地利用文本语义信息，同时也无法捕捉长文本中距离较远句子之间的交互信息，去重效果不够理想。针对文本语义去重问题，提出一种基于关键词图表示的语义去重算法。首先，通过抽取文本对中的语义关键词短语，将文本对表示为以关键词短语为结点的图；其次，通过多种方式对结点进行编码，利用图注意力网络(GAT)学习结点之间的关系得到文本对图的向量表示，并判断文本对是否语义相似；最后，根据文本对语义相似度进行去重处理。与传统方法相比，所提算法能够有效地利用文本的语义信息，并且通过图结构将长文本中距离较远的句子通过关键词短语的共现关系进行连接，增加不同句子之间的语义交互。实验结果表明，所提算法在两个公开数据集CNSE和CNSS上都取得了相较于Simhash、BERT微调、概念交互图(CIG)等传统算法更好的表现，在CNSE数据集的F1值达到84.65%，CNSS数据集的F1值达到90.76%，说明所提算法可以有效提升文本去重任务的效果。

关键词: 文本语义去重, 关键词抽取, 文本匹配, 图表示, 图注意力网络

CLC Number:

TP391.1

汪锦云向阳.

基于关键词图表示的文本语义去重算法 [J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081. 2022101495.

[1]	Jiaxin LI, Site MO. Power work order classification in substation area based on MiniRBT-LSTM-GAT and label smoothing [J]. Journal of Computer Applications, 2025, 45(4): 1356-1362.
[2]	Liang ZHU, Jingzhe MU, Hongqiang ZUO, Jingzhong GU, Fubao ZHU. Location privacy-preserving recommendation scheme based on federated graph neural network [J]. Journal of Computer Applications, 2025, 45(1): 136-143.
[3]	Jianpeng HU, Lichen ZHANG. Deep spatio-temporal network model for multi-time step wind power prediction [J]. Journal of Computer Applications, 2025, 45(1): 98-105.
[4]	Yu DU, Yan ZHU. Constructing pre-trained dynamic graph neural network to predict disappearance of academic cooperation behavior [J]. Journal of Computer Applications, 2024, 44(9): 2726-2731.
[5]	Hang YANG, Wanggen LI, Gensheng ZHANG, Zhige WANG, Xin KAI. Multi-layer information interactive fusion algorithm based on graph neural network for session-based recommendation [J]. Journal of Computer Applications, 2024, 44(9): 2719-2725.
[6]	Shibin LI, Jun GONG, Shengjun TANG. Semi-supervised heterophilic graph representation learning model based on Graph Transformer [J]. Journal of Computer Applications, 2024, 44(6): 1816-1823.
[7]	Tianci KE, Jianhua LIU, Shuihua SUN, Zhixiong ZHENG, Zijie CAI. Aspect-level sentiment analysis model combining strong association dependency and concise syntax [J]. Journal of Computer Applications, 2024, 44(6): 1786-1795.
[8]	Dongju YANG, Chengfu HU. Keyword extraction method for scientific text based on improved TextRank [J]. Journal of Computer Applications, 2024, 44(6): 1720-1726.
[9]	Lei GUO, Zhen JIA, Tianrui LI. Relational and interactive graph attention network for aspect-level sentiment analysis [J]. Journal of Computer Applications, 2024, 44(3): 696-701.
[10]	Dapeng XU, Xinmin HOU. Feature selection method for graph neural network based on network architecture design [J]. Journal of Computer Applications, 2024, 44(3): 663-670.
[11]	Linqin WANG, Te ZHANG, Zhihong XU, Yongfeng DONG, Guowei YANG. Fusing entity semantic and structural information for knowledge graph reasoning [J]. Journal of Computer Applications, 2024, 44(11): 3371-3378.
[12]	Wenjuan JIANG, Yi GUO, Jiaojiao FU. Reasoning question answering model of complex temporal knowledge graph with graph attention [J]. Journal of Computer Applications, 2024, 44(10): 3047-3057.
[13]	Jinke DENG, Wenjie DUAN, Shunxiang ZHANG, Yuqing WANG, Shuyu LI, Jiawei LI. Complex causal relationship extraction based on prompt enhancement and bi-graph attention network [J]. Journal of Computer Applications, 2024, 44(10): 3081-3089.
[14]	Haiwei FAN, Xinsiyu LU, Limiao ZHANG, Yisheng AN. Citation recommendation algorithm fusing knowledge graph and graph attention network [J]. Journal of Computer Applications, 2023, 43(8): 2420-2425.
[15]	Zhixiong ZHENG, Jianhua LIU, Shuihua SUN, Ge XU, Honghui LIN. Aspect-based sentiment analysis model fused with multi-window local information [J]. Journal of Computer Applications, 2023, 43(6): 1796-1802.

Text semantic de-duplication algorithm based on keyword graph representation#br#
#br#

基于关键词图表示的文本语义去重算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Text semantic de-duplication algorithm based on keyword graph representation#br# #br#

基于关键词图表示的文本语义去重算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Text semantic de-duplication algorithm based on keyword graph representation#br#
#br#